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Abstract — Current Machine learning and pattern 
recognition method needs big dataset to produce efficient and 
accurate recognizers. The lack of standard big Arabic dataset is 
one of the big challenges that faces the research in this area. This 
paper presents the Arabic dataset collected and annotated by 
SUST-ALT (Sudan University of Science and Technology - 
Arabic Language Technology group) to contribute in filling this 
gap. The datasets contain: numerals datasets, isolated Arabic 
letters datasets, Arabic names datasets. Tese datasets contain 
offline dataset as well as online ones. The paper also describes 
some published results as well as future work. 

Index Terms —Arabic language recognition, dataset, 
Manchine learning, Pattern recognition. 

I. Introduction 

Good effort has been done so far towards building efficient 
automatic reading systems that enable us to get rid of the 
keyboard as the main entrance to computers. However, we are 
still far away from saying we have robust solutions for this 
problem. All the available OCR technologies work in 
restricted environments. To be sure that a human being person 
is entering data not an electronic agent, web-based data 
entering systems use rough and mixed-up writing, which 
should be read and reentered by that person. This is a strong 
evident that electronic reading systems are far from 
competing human reading faculty. This is the case for Latin 
scripts, for Arabic language we are lagging behind by at least 
one decade. The conclusion of this says we need to keep doing 
research in this area for all languages. However, for languages 
like Arabic we should increase and intensify our research 
work. 

This paper describes the data set collected and 
preprocessed by SUST ALT (Sudan University of Science 
and Technology- Arabic Language Technology group) 

Section II, gives some basic information about Arabic 
language and its alphabet. Section III, contains four 
subsections. Each subsection describes one dataset. Section 
IV, contains discussion and as usual section V, conclude the 
paper. 

II. ARABIC LNAGUAGE 

Arabic is the official language of more than twenty 
countries and the mother tongue of more than 300 million 
people [1]. Arabic is one of the six United Nations official 


languages. Unlike Latin Arabic is written form right to left. 
The Arabic script is also used as a medium of writing for other 
languages like Persian. Moreover, Arabic script is the former 
script of the Turkish language as the Arabic script was the 
script of the Ottoman Empire. The Ottoman Empire produced 
millions of written documents. Although, most of these 
documents archived in Turkey, there is considerable part of 
them distributed around several other countries. Digitizing 
these documents and building efficient retrieval system for 
them is still a big research challenge [2]. 

Arabic script is cursive. However, out of 28 letters there 
are 6 non-cursive letters. These 6 letters cannot be connected 
to the succeeding letters. Thus an Arabic word may be 
decomposed into two sub words or more. Each Arabic letter 
may have up to 4 different shapes, table I, shows some 
Arabic letters with their different shapes. 

Nowadays, there are three categories for Arabic 
language [3]: 

Classical Arabic: the ancient language and the language 
of Quran. 

Modern standard Arabic : the universal language of 
Arabic-speaking world, which understood by all speakers and 
used by the media and the academic and official communities. 

Local Arabic dialects : there are many local dialects 
which contain many words and constructs understood by the 
local people only. 


TABLE I 

Some Arabic letter and their different writing shales 


Name 

Isolated 

Initial 

Medial 

Final 
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Baa 
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£ 
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mym 

1* 

-O 
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III. SUST ALT DATASETS 

It may be a good idea to have huge annotated dataset that 
represent all types of writing to train and test new recognition 
systems. However, such dataset will not be useful for small 
scale research project. For this reason we have decided to 
build many datasets to be used of studying, investigating, 
training, and testing small proposal as well as big ones. The 
rest of this section outlines these dataset. All these dataset are 
available freely for researchers in www.sustech.edu . 

A. Hindi numeral dataset 

In first stages of OCR work for a specific language we 
usually try to recognize that language digits. Although, now 
many Hindi digit dataset exist, we build our own digit dataset 
to learn some lessons from doing this. However, it is also 
important for us to have dataset from our environment as it 
could be different from other environments, table n, shows 
the basic shape for each digits and it frequencies in the 
dataset. 


table n 

Hindi numeral dataset samples and sizes 


Digit 

Hindi 

samples 

frequency 

0 

% 


3680 

1 

) 
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2 

Y 

c 

3678 

3 

r 

(a . j—" % _ j. - 
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4 

t 

if 
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5 
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7 

V 

■V J* 1 
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8 

A 

\ |j l 

3702 

9 

<1 

tL 

3709 


B. Isolated Arabic letters dataset 

As illustrated in section II, Arabic letters has many 
shapes. However, it is useful to have a dataset for isolated 
letters as some applications use isolated letters. In addition, 
this data set will help in primary investigation of the language. 


table m 

Some samples form the letters dataset 


Lette 

r 

name 

Typed 

form 

samples 

Alif 
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J 
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Ji 


C. Arabic names dataset 

The source of this dataset is the SUST graduation 
certificate application form, see Fig. 2. The student should 
write his name up to the third grandfather in this form. These 
documents had been collected from the SUST registrar office, 
scanned and segmented. Fig. 1, contains samples for the name 
Mohamed “4£”in this dataset 
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Fig. 1: Sample for the name Mohamed “Jux^jo “ in names dataset 
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D. Online Arabic dataset 

Two online Arabic datasets has been established 
(SUSTOLAH - Sudan University of Science and Technology 
Online Arabic Handwritten data). The first one of these 
datasets (i.e., the dataset of the letters) contains 7827 samples 
of online handwriting for isolated Arabic letters. The second 
data (i.e., the dataset of the persons' names) contains 3097 
samples of online Arabic handwriting for person's names. 
More than one hundred and fifty writers-from various high 
educational institution in Sudan- are contributed in the 
collection of these samples. A list of twenty person's names is 
specified for the collection of the samples for the dataset of 
the names. On the other side, the basic Arabic letters have 
been appointed for the collection of the samples for the 
dataset of the letters. Table 1 shows part of the name dataset, 
the table contains 8 rows for eight different Arabic names with 
the variation for their. 

In comparison with the ADAB — an online Arabic dataset 
which known as a standard benchmark in the ICDAR 
competition of 2009 [7] SUSTOLAH has the following 
characteristics: 

• SUSTOLAH involve handwritten objects of isolated 
Arabic letters as well as handwritten objects of 
cursive Arabic words. This property assigns a 
pedagogical research importance to SUSTOLAH. 

• The datasets have an aspectual 
representation for the pen tips and strokes which 
formulate the handwriting. 

• SUSTOLAH has a software tool 
for the collection of online Arabic handwriting as 
well as a verification tool. Therefore, researchers are 
able to create their own datasets by these tools. 

IV. DISCUSSION AND SOME 
PUBLISHED RESULTS 

These datasets are newly established waiting for extensive 
machine learning and pattern recognition research work. 
However, some research work has been published using them. 
Jadeed et al., has designed and tested a Support Vector 
Machine classifier for the digits dataset[4]. The accuracy of 
this classifier is 89% . The digits which has bad results are: 
zero, because it resemble the noise; two and three because 
they are very similar in their shape. Balola et al, has designed 
and tested a multi-layer perception for the isolated letters 
classification [5,8]. The main result of his work shows that the 
feature that causes the main challenge for Arabic letters 
classification is the usage of dots to differentiate similar 
letters like dal and thal & "^"). Ali et al. have designed 
and tested a holistic classifier based on probabilistic neural 
network to classify Arabic names. Their experiments show 
that with high rejection rate the recognition rate of this 
classifier is very high too [6]. The online dataset is most 


newest one, however, some interesting results for these data 
set is published in [9,10] 

V. CONCLUSION 

SUST ALT is a research group that work in Arabic language 
technology. One of its current interest of is the Arabic 
handwriting recognition. This paper describes five datasets 
established by this group to support the research work 
Machine learning and pattern recognition generally and 
Arabic recognition especially. The data contains: numerals, 
letters and names datasets. 
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Table 1: this table contains statistic for eight Arabic names; the tables shows the number of sample for each name as well as the different number of strokes 

for the name and their statistics 


Name 

Number of Patterns which take 

Total 

Number 

of 

Patterns 

1 
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Strokes 
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Fig. 2: Part of the application form used in name dataset collection 
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